The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.
This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of official sources and hosted on the Our World in Data website (Mathieu et al. 2021).
To run R and RStudio on Binder, click on this badge - .
Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.
For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we’ve got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, patchwork to include several plots in a single figure and a few others to assist in getting the data into shape.
To use these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they’re not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.
# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
p_load(tidyverse, paletteer, glue, scales, plotly, lubridate, patchwork, visdat)More information on installing and using R packages can be found in this tutorial.
Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).
We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data directly from a file on the Our World in Data website into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R, which won’t automatically parse columns containing dates and in previous versions of R (<4.0) will slightly change the structure of the resulting data frame (all text columns will be converted into factors).
> Note that in this case, we need to specify the column types because the data contains a lot of missing values that interfere with the automatic parsing of the column types._
# read data from Our World in Data website
covid_data <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv",
col_types = paste(c("c", "f", "c", "D", rep("d", 29),
"c", rep("d", 33)), collapse = ""))Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column (see detailed information on each variable in the data repository on GitHub):
#explore the data frame
head(covid_data) # show first 10 rows of the data and type of variables## # A tibble: 6 × 67
## iso_code continent location date total_cases new_cases new_cases_smoot…
## <chr> <fct> <chr> <date> <dbl> <dbl> <dbl>
## 1 AFG Asia Afghanis… 2020-02-24 5 5 NA
## 2 AFG Asia Afghanis… 2020-02-25 5 0 NA
## 3 AFG Asia Afghanis… 2020-02-26 5 0 NA
## 4 AFG Asia Afghanis… 2020-02-27 5 0 NA
## 5 AFG Asia Afghanis… 2020-02-28 5 0 NA
## 6 AFG Asia Afghanis… 2020-02-29 5 0 0.714
## # … with 60 more variables: total_deaths <dbl>, new_deaths <dbl>,
## # new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## # new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## # total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## # new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## # icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## # hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, …
We can also produce some descriptive statistics to better understand the data and the nature of each variable. The skim() function (from the skimr package) provides an alternative to the base summary() function and produces a summary of basic descriptive statistics, such as the mean, min, max, quantiles and even the SD and rate of missing data for continuous numerical values.
# summary of variables in my data
# summary(covid_data)
skim(covid_data)| Name | covid_data |
| Number of rows | 206852 |
| Number of columns | 67 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| Date | 1 |
| factor | 1 |
| numeric | 62 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| iso_code | 0 | 1.00 | 3 | 8 | 0 | 244 | 0 |
| location | 0 | 1.00 | 4 | 32 | 0 | 244 | 0 |
| tests_units | 100066 | 0.52 | 13 | 15 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2020-01-01 | 2022-08-07 | 2021-05-31 | 950 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| continent | 0 | 1 | FALSE | 7 | Afr: 47888, Eur: 44907, Asi: 44592, Nor: 31958 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| total_cases | 8381 | 0.96 | 3817978.55 | 2.396678e+07 | 1.00 | 3401.00 | 41533.00 | 446271.00 | 5.841024e+08 | ▇▁▁▁▁ |
| new_cases | 8666 | 0.96 | 12581.45 | 8.730694e+04 | 0.00 | 0.00 | 65.00 | 1013.00 | 4.079467e+06 | ▇▁▁▁▁ |
| new_cases_smoothed | 9843 | 0.95 | 12578.91 | 8.522271e+04 | 0.00 | 6.71 | 100.43 | 1149.43 | 3.437955e+06 | ▇▁▁▁▁ |
| total_deaths | 27155 | 0.87 | 69051.36 | 3.577979e+05 | 1.00 | 103.00 | 1032.00 | 9090.00 | 6.417879e+06 | ▇▁▁▁▁ |
| new_deaths | 27198 | 0.87 | 148.22 | 7.611400e+02 | 0.00 | 0.00 | 1.00 | 16.00 | 1.819100e+04 | ▇▁▁▁▁ |
| new_deaths_smoothed | 28365 | 0.86 | 148.93 | 7.446800e+02 | 0.00 | 0.14 | 1.71 | 17.29 | 1.481714e+04 | ▇▁▁▁▁ |
| total_cases_per_million | 9293 | 0.96 | 50068.28 | 9.138295e+04 | 0.00 | 909.94 | 8158.78 | 61425.71 | 6.553093e+05 | ▇▁▁▁▁ |
| new_cases_per_million | 9578 | 0.95 | 188.93 | 9.174400e+02 | 0.00 | 0.00 | 9.24 | 102.51 | 1.950053e+05 | ▇▁▁▁▁ |
| new_cases_smoothed_per_million | 10750 | 0.95 | 188.87 | 5.993700e+02 | 0.00 | 1.49 | 19.93 | 135.62 | 3.525884e+04 | ▇▁▁▁▁ |
| total_deaths_per_million | 28054 | 0.86 | 637.81 | 9.258300e+02 | 0.00 | 26.35 | 184.22 | 946.65 | 6.364670e+03 | ▇▁▁▁▁ |
| new_deaths_per_million | 28097 | 0.86 | 1.50 | 5.260000e+00 | 0.00 | 0.00 | 0.06 | 1.10 | 5.538000e+02 | ▇▁▁▁▁ |
| new_deaths_smoothed_per_million | 29259 | 0.86 | 1.50 | 3.450000e+00 | 0.00 | 0.00 | 0.23 | 1.49 | 1.486700e+02 | ▇▁▁▁▁ |
| reproduction_rate | 51822 | 0.75 | 0.97 | 3.800000e-01 | -0.05 | 0.77 | 0.98 | 1.17 | 5.860000e+00 | ▇▃▁▁▁ |
| icu_patients | 179515 | 0.13 | 859.31 | 2.501980e+03 | 0.00 | 33.00 | 170.00 | 619.00 | 2.889100e+04 | ▇▁▁▁▁ |
| icu_patients_per_million | 179515 | 0.13 | 22.24 | 2.672000e+01 | 0.00 | 4.08 | 11.81 | 31.58 | 1.803900e+02 | ▇▂▁▁▁ |
| hosp_patients | 177595 | 0.14 | 4210.63 | 1.097507e+04 | 0.00 | 163.00 | 814.00 | 3202.00 | 1.545130e+05 | ▇▁▁▁▁ |
| hosp_patients_per_million | 177595 | 0.14 | 160.29 | 1.958400e+02 | 0.00 | 31.38 | 89.38 | 207.72 | 1.546500e+03 | ▇▁▁▁▁ |
| weekly_icu_admissions | 200074 | 0.03 | 439.40 | 5.949800e+02 | 0.00 | 42.00 | 206.00 | 603.00 | 4.838000e+03 | ▇▁▁▁▁ |
| weekly_icu_admissions_per_million | 200074 | 0.03 | 13.71 | 1.544000e+01 | 0.00 | 3.66 | 8.69 | 17.93 | 2.229000e+02 | ▇▁▁▁▁ |
| weekly_hosp_admissions | 193415 | 0.06 | 5637.07 | 1.362218e+04 | 0.00 | 372.00 | 1339.00 | 5376.00 | 1.539880e+05 | ▇▁▁▁▁ |
| weekly_hosp_admissions_per_million | 193415 | 0.06 | 99.81 | 1.004500e+02 | 0.00 | 27.50 | 72.50 | 134.91 | 6.589400e+02 | ▇▂▁▁▁ |
| total_tests | 127467 | 0.38 | 21093607.55 | 8.404015e+07 | 0.00 | 364660.00 | 2067330.00 | 10248295.00 | 9.214000e+09 | ▇▁▁▁▁ |
| new_tests | 131451 | 0.36 | 67283.91 | 2.477363e+05 | 1.00 | 2244.00 | 8783.00 | 37228.00 | 3.585563e+07 | ▇▁▁▁▁ |
| total_tests_per_thousand | 127467 | 0.38 | 916.22 | 2.168240e+03 | 0.00 | 43.77 | 234.26 | 891.93 | 3.292590e+04 | ▇▁▁▁▁ |
| new_tests_per_thousand | 131451 | 0.36 | 3.24 | 8.970000e+00 | 0.00 | 0.29 | 0.97 | 2.91 | 5.340100e+02 | ▇▁▁▁▁ |
| new_tests_smoothed | 102889 | 0.50 | 142176.50 | 1.138225e+06 | 0.00 | 1486.00 | 6570.00 | 32204.00 | 1.476998e+07 | ▇▁▁▁▁ |
| new_tests_smoothed_per_thousand | 102889 | 0.50 | 2.80 | 7.260000e+00 | 0.00 | 0.20 | 0.86 | 2.58 | 1.476000e+02 | ▇▁▁▁▁ |
| positive_rate | 110932 | 0.46 | 0.10 | 1.200000e-01 | 0.00 | 0.02 | 0.06 | 0.14 | 1.000000e+00 | ▇▁▁▁▁ |
| tests_per_case | 112794 | 0.45 | 2397.00 | 3.349475e+04 | 1.00 | 7.10 | 17.50 | 53.70 | 1.023632e+06 | ▇▁▁▁▁ |
| total_vaccinations | 149596 | 0.28 | 258320512.19 | 1.079031e+09 | 0.00 | 942245.75 | 7791557.50 | 51941563.25 | 1.239910e+10 | ▇▁▁▁▁ |
| people_vaccinated | 152217 | 0.26 | 122220465.03 | 5.093437e+08 | 0.00 | 536085.00 | 4311020.00 | 25961193.00 | 5.311033e+09 | ▇▁▁▁▁ |
| people_fully_vaccinated | 154715 | 0.25 | 103974711.49 | 4.481811e+08 | 1.00 | 458891.00 | 3854304.00 | 22353415.00 | 4.880663e+09 | ▇▁▁▁▁ |
| total_boosters | 177954 | 0.14 | 49079214.10 | 2.037780e+08 | 1.00 | 52759.25 | 1670522.50 | 12824204.75 | 2.296682e+09 | ▇▁▁▁▁ |
| new_vaccinations | 159924 | 0.23 | 1026683.71 | 3.854329e+06 | 0.00 | 4761.75 | 36923.50 | 269547.25 | 4.966707e+07 | ▇▁▁▁▁ |
| new_vaccinations_smoothed | 91004 | 0.56 | 433909.63 | 2.463675e+06 | 0.00 | 701.00 | 7009.50 | 53783.25 | 4.368788e+07 | ▇▁▁▁▁ |
| total_vaccinations_per_hundred | 149596 | 0.28 | 93.52 | 7.560000e+01 | 0.00 | 20.79 | 85.12 | 150.95 | 3.668700e+02 | ▇▅▃▁▁ |
| people_vaccinated_per_hundred | 152217 | 0.26 | 44.73 | 3.010000e+01 | 0.00 | 14.23 | 48.80 | 71.82 | 1.287800e+02 | ▇▅▇▅▁ |
| people_fully_vaccinated_per_hundred | 154715 | 0.25 | 39.69 | 2.932000e+01 | 0.00 | 9.45 | 41.17 | 65.80 | 1.267900e+02 | ▇▃▆▂▁ |
| total_boosters_per_hundred | 177954 | 0.14 | 22.02 | 2.360000e+01 | 0.00 | 0.53 | 12.07 | 40.15 | 1.342700e+02 | ▇▃▂▁▁ |
| new_vaccinations_smoothed_per_million | 91004 | 0.56 | 2730.81 | 3.655220e+03 | 0.00 | 395.00 | 1466.50 | 3810.00 | 1.178620e+05 | ▇▁▁▁▁ |
| new_people_vaccinated_smoothed | 91903 | 0.56 | 165641.43 | 1.010110e+06 | 0.00 | 210.00 | 2386.00 | 19121.00 | 2.106905e+07 | ▇▁▁▁▁ |
| new_people_vaccinated_smoothed_per_hundred | 91903 | 0.56 | 0.12 | 2.200000e-01 | 0.00 | 0.01 | 0.04 | 0.14 | 1.179000e+01 | ▇▁▁▁▁ |
| stringency_index | 51666 | 0.75 | 50.24 | 2.168000e+01 | 0.00 | 35.19 | 49.92 | 67.59 | 1.000000e+02 | ▂▆▇▆▂ |
| population | 1229 | 0.99 | 143216682.11 | 6.961334e+08 | 47.00 | 896007.00 | 7494578.00 | 33573874.00 | 7.909295e+09 | ▇▁▁▁▁ |
| population_density | 22778 | 0.89 | 457.35 | 2.109000e+03 | 0.14 | 37.73 | 88.12 | 214.24 | 2.054677e+04 | ▇▁▁▁▁ |
| median_age | 36286 | 0.82 | 30.64 | 9.070000e+00 | 15.10 | 22.30 | 30.60 | 39.10 | 4.820000e+01 | ▇▆▇▆▆ |
| aged_65_older | 38089 | 0.82 | 8.82 | 6.130000e+00 | 1.14 | 3.53 | 6.70 | 14.18 | 2.705000e+01 | ▇▃▂▂▁ |
| aged_70_older | 37179 | 0.82 | 5.56 | 4.160000e+00 | 0.53 | 2.06 | 4.21 | 8.68 | 1.849000e+01 | ▇▃▂▂▁ |
| gdp_per_capita | 37275 | 0.82 | 19602.11 | 2.056391e+04 | 661.24 | 4449.90 | 12951.84 | 27936.90 | 1.169356e+05 | ▇▂▁▁▁ |
| extreme_poverty | 96239 | 0.53 | 13.63 | 2.003000e+01 | 0.10 | 0.60 | 2.20 | 21.20 | 7.760000e+01 | ▇▁▁▁▁ |
| cardiovasc_death_rate | 36710 | 0.82 | 260.96 | 1.199300e+02 | 79.37 | 170.05 | 243.96 | 329.94 | 7.244200e+02 | ▇▇▃▁▁ |
| diabetes_prevalence | 28270 | 0.86 | 8.37 | 4.690000e+00 | 0.99 | 5.31 | 7.20 | 10.59 | 3.053000e+01 | ▇▇▂▁▁ |
| female_smokers | 78160 | 0.62 | 10.66 | 1.060000e+01 | 0.10 | 1.90 | 6.30 | 19.30 | 4.400000e+01 | ▇▂▂▁▁ |
| male_smokers | 79923 | 0.61 | 32.80 | 1.353000e+01 | 7.70 | 21.60 | 31.40 | 41.30 | 7.810000e+01 | ▆▇▆▂▁ |
| handwashing_facilities | 123295 | 0.40 | 50.88 | 3.185000e+01 | 1.19 | 20.86 | 49.84 | 82.50 | 1.000000e+02 | ▇▅▅▅▇ |
| hospital_beds_per_thousand | 55673 | 0.73 | 3.09 | 2.550000e+00 | 0.10 | 1.30 | 2.50 | 4.00 | 1.380000e+01 | ▇▃▂▁▁ |
| life_expectancy | 13348 | 0.94 | 73.65 | 7.450000e+00 | 53.28 | 69.50 | 75.05 | 79.07 | 8.675000e+01 | ▁▃▅▇▅ |
| human_development_index | 41146 | 0.80 | 0.73 | 1.500000e-01 | 0.39 | 0.60 | 0.74 | 0.84 | 9.600000e-01 | ▂▅▅▇▇ |
| excess_mortality_cumulative_absolute | 199831 | 0.03 | 44438.78 | 1.238340e+05 | -37726.10 | 12.10 | 5114.90 | 31461.00 | 1.219078e+06 | ▇▁▁▁▁ |
| excess_mortality_cumulative | 199831 | 0.03 | 9.62 | 1.377000e+01 | -28.45 | 0.18 | 7.07 | 15.10 | 7.655000e+01 | ▁▇▂▁▁ |
| excess_mortality | 199813 | 0.03 | 14.73 | 2.708000e+01 | -95.92 | -0.30 | 7.25 | 20.64 | 3.759800e+02 | ▂▇▁▁▁ |
| excess_mortality_cumulative_per_million | 199831 | 0.03 | 1219.06 | 1.628300e+03 | -1694.15 | 7.01 | 679.86 | 1915.23 | 9.726080e+03 | ▇▆▂▁▁ |
# find extreme rows
covid_data %>% arrange(desc(total_cases))## # A tibble: 206,852 × 67
## iso_code continent location date total_cases new_cases new_cases_smoot…
## <chr> <fct> <chr> <date> <dbl> <dbl> <dbl>
## 1 OWID_WRL "" World 2022-08-06 584102431 1174416 1039605.
## 2 OWID_WRL "" World 2022-08-05 582928015 912410 953741.
## 3 OWID_WRL "" World 2022-08-04 582015605 1429455 1019857.
## 4 OWID_WRL "" World 2022-08-03 580597930 1146509 969115.
## 5 OWID_WRL "" World 2022-08-02 579451421 1170931 1006328.
## 6 OWID_WRL "" World 2022-08-01 578280490 910073 1006651.
## 7 OWID_WRL "" World 2022-07-31 577370417 533440 1023214.
## 8 OWID_WRL "" World 2022-07-30 576836977 573370 1025368.
## 9 OWID_WRL "" World 2022-07-29 576263607 1375222 1034208.
## 10 OWID_WRL "" World 2022-07-28 574888391 1074261 991894.
## # … with 206,842 more rows, and 60 more variables: total_deaths <dbl>,
## # new_deaths <dbl>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## # new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## # total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## # new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## # icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## # hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, …
What are the metadata columns that describe our observations?
continent
location
date
Why do we have observations with missing continent?
# check which location have continent as NA
covid_data %>% filter(continent=="" | is.na(continent)) %>% count(location)## # A tibble: 13 × 2
## location n
## <chr> <int>
## 1 Africa 906
## 2 Asia 928
## 3 Europe 927
## 4 European Union 927
## 5 High income 928
## 6 International 912
## 7 Low income 896
## 8 Lower middle income 928
## 9 North America 928
## 10 Oceania 925
## 11 South America 897
## 12 Upper middle income 928
## 13 World 928
Some rows contain summarised data of entire continents/World, we'll need to remove those
We can see that most of our data contains ‘0’ (check the difference between the median and the mean in total_cases and total_deaths columns).
Just to confirm that, let’s plot a histogram of all the confirmed cases
ggplot(covid_data, aes(x=total_cases)) +
geom_histogram(fill="lightskyblue") +
theme_bw(def_text_size)The data is evolving over days (a time-series), to there’s no point treating it as a random population sample.
Let’s look at confirmed cases and total deaths data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:
filter(!is.na(continent)group_by()arrange(desc(date))slice(1) and remove grouping with ungroup()inner_join()Optional step:
location variable as a factor and order it so the countries will be ordered in the legend by the number of casesThen we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.
# find the 10 most affected countries (to date)
latest_data <- covid_data %>% filter(!is.na(continent), continent!="") %>%
group_by(location) %>% arrange(desc(date)) %>% slice(1) %>% ungroup()
most_affected_countries <- latest_data %>%
arrange(desc(total_cases)) %>% slice(1:10) %>%
select(location)
# subset just the data from the 10 most affected countries and order them from the most affected to the least one
most_affected_data <- covid_data %>%
inner_join(most_affected_countries) %>%
mutate(Country=factor(location, levels = most_affected_countries$location))
# create a line plot the data of total cases
ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Country", y = "Total COVID-19 cases") +
theme_bw(def_text_size)We can look at the exponential increase at the early days of the pandemic if we transform the y-axis to logarithmic scale (and also improve how of the dates appear in the x-axis).
# better formatting of date axis, log scale
plot <- ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
geom_line(size=0.75) + scale_y_log10(labels=comma) +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Country", y = "Total COVID-19 cases") +
theme_bw(def_text_size)
# show an interactive plot
ggplotly(plot)Why did we get a warning message and why the graphs don’t start at the bottom of the x-axis? How can we solve it? What can we infer from the graph (exponential increase)?
What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function (or add a very small number to them)?
We can see a very similar trend for most countries and while the curve has flattened substantially in 2021, the numbers are still rising. It is also evident that Europe got hit by a second wave arount October 2020 and then again around Feb 2022 (with a huge rise in cases in South Korea).
Let’s have a look at the number of vaccinated people.
# vaccination rates
ggplot(most_affected_data, aes(x=date, y=people_vaccinated, colour=Country)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y="People Vaccinated") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())The graphs are “broken”, meaning that it is not continuous and we have some missing data.
Let’s visualise some of the variables in our data and assess “missingness”.
# visualise missingness
vis_dat(covid_data %>% filter(date>dmy("01-01-2022")) %>%
select(continent, location, total_cases, total_deaths,
hosp_patients, people_vaccinated, people_fully_vaccinated))# find which countries has the most number of observations (least missing data)
# covid_data %>% filter(!is.na(continent), continent!="", !is.na(people_vaccinated)) %>% # group_by(location) %>%
# count(location) %>% arrange(desc(n)) %>% print(n=30)
#
# covid_data %>% filter(!is.na(continent), continent!="", !is.na(hosp_patients)) %>% # group_by(location) %>%
# count(location) %>% arrange(desc(n)) %>% print(n=30)Now we can focus on a subset of countries that have more complete vaccination and hospitalisation rates data, so we could compare them.
countries_subset <- c("Italy", "Canada", "Israel", "United Kingdom", "France", "Australia", "Malaysia", 'South Africa', "India")
# subset our original data to these countries
hosp_data <- covid_data %>% filter(location %in% countries_subset)
# define a new colour palette for these countries
col_pal <- "ggsci::category10_d3"Let’s look at hospitalisation rates first.
ggplot(hosp_data, aes(x=date, y = hosp_patients,colour=location)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Hospitalised patients") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())Can you identify the “waves” in each country?
It’s hard to see the details in the countries with lower number of hospitalised patients, how can we improve the visualisation?
Look at hospitalision rates proportional to the population size!
# hosp per population size
p1 <- ggplot(hosp_data,
aes(x=date, y = hosp_patients_per_million,colour=location)) +
geom_line(size=0.75) +
scale_y_continuous(labels=comma) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Hospitalised patients (per million)") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
p1Now let’s try to compare this to vaccination rates
# total vaccination per population
p2 <- ggplot(hosp_data,
aes(x=date, y = people_fully_vaccinated_per_hundred/100 ,colour=location)) +
geom_line(size=0.75, linetype="dashed") +
guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
scale_y_continuous(labels=percent) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Rate of fully vaccinated people") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
p2What will be the best way to compare these values? Maybe like this (note that this syntax is used by patchwork package to arrange the plots):
(p1 + guides(color="none")) / (p2 + theme(legend.position = "bottom")) #+ plot_layout(guides = 'collect')# show graphs on top of each otherAny other suggestions?
Let’s try to present them on the same graph (note the trick with the secondary y-axis), but for each country separately (done with the facet_wrap() function)..
# show on the same graph
p4 <- ggplot(hosp_data,
aes(x=date, colour=location)) +
geom_line(aes(y = hosp_patients_per_million), size=0.75) +
geom_line(aes(y = people_fully_vaccinated_per_hundred*10), size=0.75, linetype="dashed") +
scale_y_continuous(name = "Hospitalised patients per million (solid)",
# Add a second axis and specify its features
sec.axis = sec_axis(trans=~./10,
name="Fully vaccinated per hundred (dashed)")) +
scale_color_paletteer_d(col_pal) +
scale_x_date(NULL,
breaks = breaks_width("6 months"),
labels = label_date_short()) +
labs(color="Country") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank()) +
facet_wrap(~location)
p4Save the plot to a file.
# create output folder
dir.create("./output", showWarnings = FALSE)
# save the plot to pdf file
ggsave("output/hospit_vacc_rates_facet_country.pdf", width=14, height = 8)1. Mortalities (Case Fatality Rate)?
2. Suggestions? (cases per population density, vaccination rates by country income, deaths by number of beds per capita, etc.
3. Level of reporting in each country...
tweet_embed("https://twitter.com/YannisMarkonis/status/1276211134186602499")Best way to conclude the presentation about our new Bachelor program in Environmental Data Science in @CZUvPraze. Small edit in the programming language used in the original strip of @xkcdComic. Sorry Pearl users :) #DataScience #rstats @FesCuls pic.twitter.com/KBkjLgFNwR
— Yannis Markonis (@YannisMarkonis) June 25, 2020
Please contact me at i.bar@griffith.edu.au for any questions or comments.